Unify SIMD arithmetic under a shared transform_binary template#740
Unify SIMD arithmetic under a shared transform_binary template#740yungyuc merged 2 commits intosolvcon:masterfrom
Conversation
488d5cd to
11a6eb1
Compare
tigercosmos
left a comment
There was a problem hiding this comment.
@yungyuc The PR is ready for review. Thanks!
| // every correctness check. Kept under an underscore-prefixed name | ||
| // because detect_simd() only meaningfully reflects the dispatched | ||
| // backend on aarch64 today; on other targets it would mislead users. | ||
| mod.def("_simd_feature", &simd_feature_name); |
There was a problem hiding this comment.
For checking if simd is working.
There was a problem hiding this comment.
cpp/modmesh/toggle/ may be a more on-topic module for the SIMD check, but it's fine to have it here in buffer.
| struct vec_add | ||
| { | ||
| return generic::check_between<T>(start, end, min_val, max_val); | ||
| } | ||
|
|
||
| template <typename T, typename std::enable_if_t<type::has_vectype<T>> * = nullptr> | ||
| const T * check_between(T const * start, T const * end, T const & min_val, T const & max_val) | ||
| template <typename V> | ||
| static auto operator()(V a, V b) -> decltype(vaddq(a, b)) { return vaddq(a, b); } | ||
| }; |
There was a problem hiding this comment.
Key design in this PR.
| constexpr size_t N_lane = type::vector_lane<T>; | ||
| if constexpr (!std::invocable<VecOp, vec_t, vec_t>) | ||
| { | ||
| generic::transform_binary<T>(dest, dest_end, src1, src2, scalar_op); |
There was a problem hiding this comment.
T does have a vector type, but the specific VecOp functor can't be called with it. For example, vdivq doesn't exist for integer vector types in NEON, so vec_div{} isn't invocable with int32x4_t.
| { | ||
| vec_t v1 = vld1q(src1); | ||
| vec_t v2 = vld1q(src2); | ||
| vst1q(ptr, vec_op(v1, v2)); |
There was a problem hiding this comment.
vec_op is called here.
| if constexpr (!type::has_vectype<T>) | ||
| { | ||
| return generic::add<T>(dest, dest_end, src1, src2); | ||
| generic::transform_binary<T>(dest, dest_end, src1, src2, scalar_op); |
There was a problem hiding this comment.
The scalar type T itself has no corresponding NEON vector type (e.g., bool, int64_t). There's no vector register representation at all, so SIMD is impossible.
|
|
||
| #include <cstddef> | ||
| #include <arm_neon.h> | ||
| #include <cstddef> |
There was a problem hiding this comment.
Formattor fixes the order, I think it should be fine. Let me know if I should revert it.
|
|
||
| template <typename T> | ||
| inline constexpr size_t has_vectype = detail::vector<T>::N_lane > 0; | ||
| inline constexpr bool has_vectype = detail::vector<T>::N_lane > 0; |
There was a problem hiding this comment.
Fixed the boolean type.
| inline void add(T * dest, T const * dest_end, T const * src1, T const * src2) | ||
| { | ||
| T * ptr = dest; | ||
| while (ptr < dest_end) | ||
| { | ||
| *ptr = *src1 - *src2; | ||
| ++ptr; | ||
| ++src1; | ||
| ++src2; | ||
| } | ||
| transform_binary<T>(dest, dest_end, src1, src2, std::plus<T>{}); | ||
| } |
There was a problem hiding this comment.
Main design of this PR.
There was a problem hiding this comment.
I am not sure if the additional abstraction still generates good SIMD binaries. Please profile to check. If you have time, also check the built assembly.
There was a problem hiding this comment.
I am not sure if the additional abstraction still generates good SIMD binaries. Please profile to check. If you have time, also check the built assembly.
Profiled on Apple M3 Pro (clang 17, -O3 -DNDEBUG -mcpu=apple-m1). Both the assembly and the throughput look clean.
Assembly. The transform_binary<T, ScalarOp, VecOp> template inlines fully — no functor calls, no extra moves, no spills. Hot-loop body for simd_add_f32:
; BEFORE (master) ; AFTER (this PR)
ldr q0, [x2], #16 ldr q0, [x2], #16
ldr q1, [x3], #16 ldr q1, [x3], #16
fadd.4s v0, v0, v1 fadd.4s v0, v0, v1
str q0, [x0], #16 str q0, [x0], #16
cmp x0, x8 subs x8, x8, #1
b.ls LBB0_1 b.ne LBB0_2
5 instr/iter on both sides; one fused control-chain macro-op (cmp/b.ls ≡ subs/b.ne for fusion purposes). Same pattern for sub/mul/div/add<int32>/mul<int32>/add<double>. mul<int64> (scalar fallback via std::invocable<VecOp, vec_t, vec_t>) is byte-identical to master.
Throughput (Gelem/s, n=16384, L2-resident, median of 3 runs):
| Op | BEFORE | AFTER | Δ |
|---|---|---|---|
add<float> |
14.91 | 14.78 | −0.9% |
mul<float> |
14.82 | 14.77 | −0.4% |
div<float> |
14.78 | 14.76 | −0.2% |
add<int32> |
14.87 | 14.71 | −1.0% |
add<double> |
5.63 | 5.69 | +1.2% |
All ops within ±2% of master across L1/L2-resident sizes — well inside run-to-run noise. Full write-up + reproducer in profiling/simd_pr740/.
There was a problem hiding this comment.
Profiled on Apple M3 Pro (clang 17,
-O3 -DNDEBUG -mcpu=apple-m1). Both the assembly and the throughput look clean.
The profiling results and assembly look good. But clang 17 looks old. The latest version provided by xcode is version 21.0.0 (clang-2100.0.123.102). Old compiler is OK since both before and after use the same version.
| if platform.machine() in ("arm64", "aarch64"): | ||
| self.assertEqual(feature, "NEON") |
There was a problem hiding this comment.
Check if NEON is working that we didn't test it before.
| self.skipTest("_simd_feature() = " + feature) | ||
|
|
||
|
|
||
| class SimdTransformBinaryTC(unittest.TestCase): |
There was a problem hiding this comment.
Some cases for checking transform_binary functionality.
There was a problem hiding this comment.
Why do you isolate this unit test out from test_buffer.py?
There was a problem hiding this comment.
The whole SIMD implementation is also outside buffer directory. I think it worths a new file.
|
@KHLee529 Could you please take a look? |
|
The unified backend look nice in my first glance. I'll dive into details later. |
There was a problem hiding this comment.
- Clarify if some
forloops can also be replaced withwhile. - Run performance test to compare the runtime before and after the change. List the results to show that the change does not degrade runtime performance.
- Rename
test_simd.pytotest_buffer_simd.py. We can discuss which name is better.
| // every correctness check. Kept under an underscore-prefixed name | ||
| // because detect_simd() only meaningfully reflects the dispatched | ||
| // backend on aarch64 today; on other targets it would mislead users. | ||
| mod.def("_simd_feature", &simd_feature_name); |
There was a problem hiding this comment.
cpp/modmesh/toggle/ may be a more on-topic module for the SIMD check, but it's fine to have it here in buffer.
| // Vector loop runs while a full lane still fits. The remaining-count | ||
| // form keeps the condition valid for buffers shorter than one lane. | ||
| T const * ptr = start; | ||
| while (static_cast<size_t>(end - ptr) >= N_lane) |
| if (ptr != dest_end) | ||
|
|
||
| // Tail scalar loop for remaining elements | ||
| for (; ptr < end; ++ptr) |
|
|
||
| #include <cstddef> | ||
| #include <arm_neon.h> | ||
| #include <cstddef> |
|
|
||
| template <typename T> | ||
| inline constexpr size_t has_vectype = detail::vector<T>::N_lane > 0; | ||
| inline constexpr bool has_vectype = detail::vector<T>::N_lane > 0; |
| inline void add(T * dest, T const * dest_end, T const * src1, T const * src2) | ||
| { | ||
| T * ptr = dest; | ||
| while (ptr < dest_end) | ||
| { | ||
| *ptr = *src1 - *src2; | ||
| ++ptr; | ||
| ++src1; | ||
| ++src2; | ||
| } | ||
| transform_binary<T>(dest, dest_end, src1, src2, std::plus<T>{}); | ||
| } |
There was a problem hiding this comment.
I am not sure if the additional abstraction still generates good SIMD binaries. Please profile to check. If you have time, also check the built assembly.
There was a problem hiding this comment.
Since most tests are against SimpleArray, I suggest to name the new test file as test_buffer_simd.py?
KHLee529
left a comment
There was a problem hiding this comment.
No change requested. Only some comments and questions listed.
| template <typename T, typename std::enable_if_t<type::has_vectype<T>> * = nullptr> | ||
| const T * check_between(T const * start, T const * end, T const & min_val, T const & max_val) | ||
| template <typename V> | ||
| static auto operator()(V a, V b) -> decltype(vaddq(a, b)) { return vaddq(a, b); } |
There was a problem hiding this comment.
Can these operator helper functions be also inlined? Based on my experience profiling the speed of SimpleArray SIMD operations, whether the vector operations are inlined impact a lot on the performance
There was a problem hiding this comment.
They are implicit inlined, and confirmed by the profiling.
| } | ||
| while (ptr < dest_end) | ||
| { | ||
| *ptr = scalar_op(*src1, *src2); |
There was a problem hiding this comment.
Nice way to remove dependency to generic functions.
| { | ||
| T idx = *ptr; | ||
| if (idx < min_val || idx > max_val) | ||
| if (*ptr < min_val || *ptr > max_val) |
There was a problem hiding this comment.
Is this refinement potentially slower due to one more dereference execution?
There was a problem hiding this comment.
No — clang CSEs the two textual *ptr reads into a single load per iteration. Same one ldr in both versions:
; BEFORE ; AFTER
ldr w10, [x0], #4 ldr w10, [x0]
cmp w10, w8 cmp w10, w8
ccmp w10, w9, #0, ge ccmp w10, w9, #0, ge
b.le ... b.gt ...
add x0, x0, #4
cmp x0, x1
b.lo ...
No extra dereference. In fact, when this function inlines into a hot caller the new form is measurably faster (~1.8× on a tight scan loop, M3 Pro) — the for (...; ++ptr) shape decouples the load operand from the pointer-bump, which gives the register allocator more freedom and avoids a redundant register-shuffle that the post incrementing ldr [x0], #4 triggers under inlining pressure.
| self.skipTest("_simd_feature() = " + feature) | ||
|
|
||
|
|
||
| class SimdTransformBinaryTC(unittest.TestCase): |
There was a problem hiding this comment.
Why do you isolate this unit test out from test_buffer.py?
| // Vector loop runs while a full lane still fits. Counted trip form | ||
| // for the same reason as transform_binary above: avoids UB on | ||
| // sub-lane inputs and the per-iter `sub` overhead. | ||
| size_t const blocks = static_cast<size_t>(end - start) / N_lane; | ||
| T const * ptr = start; | ||
| for (size_t block = 0; block < blocks; ++block) |
There was a problem hiding this comment.
An intermediate revision used while (dest_end - ptr >= N_lane) for the bound check. That form is UB-safe but forces clang to emit a non-flag-setting sub + cmp #12 pair, which breaks macro-op fusion on AArch64 and showed a real ~20–25% regression on cache-resident loops. The current head replaces it with a counted trip count (blocks = (dest_end - dest) / N_lane), which restores fusion. That is what the numbers above measure.
There was a problem hiding this comment.
I did not know that. Good to learn.
Refs solvcon#646 (Task 2). The generic and NEON backends each had four near-identical loops for add/sub/mul/div. Collapse them into a single transform_binary per backend that takes the operation as an injected functor. In the generic backend, the four ops now pass std::plus / std::minus / std::multiplies / std::divides into transform_binary. In the NEON backend they pass vec_add / vec_sub / vec_mul / vec_div wrappers around neon_alias, and std::invocable<VecOp, vec_t, vec_t> routes types without a matching vector overload (e.g. int64 for vmulq) to the scalar path at compile time. This replaces the ad-hoc vector_lane > 2 and is_floating_point_v guards. Bugs fixed along the way: - Sub-lane UB in NEON: `ptr <= dest_end - N_lane` formed a pointer before the buffer when the input was shorter than one SIMD lane. The vector loop now runs a counted trip count (`blocks = (dest_end - dest) / N_lane`), which is UB-safe on sub-lane inputs and lowers to a `subs/b.ne` macro-op-fused back-edge on AArch64. The scalar remainder is inline instead of a recursive call into generic::. The same rewrite is applied to check_between. - check_between diagnostic: the SIMD body checked the >= max mask first and only looked at < min if the first was empty, so a later too-large lane could hide an earlier too-small one. Both bounds are now inspected before picking the returned pointer. - has_vectype: declared as size_t; retyped to bool to match its predicate role. tests/test_buffer_simd.py pins _simd_feature() == "NEON" on aarch64 so a silent fallback to the scalar path cannot pass unnoticed. It covers the int32 shape matrix (n=1, 3, 4, 5, 8, 17) for transform_binary, the int64-mul SFINAE fallback, and float sub/mul/div with one block + tail. A new private modmesh._modmesh._simd_feature() binding exposes the runtime-detected backend. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@yungyuc Please take a look. All comments are addressed. Please take a look, thanks! |
| // Vector loop runs while a full lane still fits. Counted trip form | ||
| // for the same reason as transform_binary above: avoids UB on | ||
| // sub-lane inputs and the per-iter `sub` overhead. | ||
| size_t const blocks = static_cast<size_t>(end - start) / N_lane; | ||
| T const * ptr = start; | ||
| for (size_t block = 0; block < blocks; ++block) |
There was a problem hiding this comment.
I did not know that. Good to learn.
| inline void add(T * dest, T const * dest_end, T const * src1, T const * src2) | ||
| { | ||
| T * ptr = dest; | ||
| while (ptr < dest_end) | ||
| { | ||
| *ptr = *src1 - *src2; | ||
| ++ptr; | ||
| ++src1; | ||
| ++src2; | ||
| } | ||
| transform_binary<T>(dest, dest_end, src1, src2, std::plus<T>{}); | ||
| } |
There was a problem hiding this comment.
Profiled on Apple M3 Pro (clang 17,
-O3 -DNDEBUG -mcpu=apple-m1). Both the assembly and the throughput look clean.
The profiling results and assembly look good. But clang 17 looks old. The latest version provided by xcode is version 21.0.0 (clang-2100.0.123.102). Old compiler is OK since both before and after use the same version.
* Ignore: * readability-static-definition-in-anonymous-namespace * misc-use-anonymous-namespace * Add missing const * Use auto to avoid duplicate type name
Summary
Refs #646 (Task 2). The generic and NEON backends each had four near-identical loops for
add/sub/mul/div. This PR collapses them into a singletransform_binaryper backend that takes the operation as an injected functor.What changed
In the generic backend, the four ops now pass
std::plus/std::minus/std::multiplies/std::dividesintotransform_binary. In the NEON backend they passvec_add/vec_sub/vec_mul/vec_divwrappers aroundneon_alias, andstd::invocable<VecOp, vec_t, vec_t>routes types without a matching vector overload (e.g.int64forvmulq) to the scalar path at compile time. This replaces the ad-hocvector_lane > 2andis_floating_point_vguards.Bugs fixed along the way
ptr <= dest_end - N_laneformed a pointer before the buffer when the input was shorter than one SIMD lane. The vector loop now runs a counted trip count (blocks = (dest_end - dest) / N_lane), which is UB-safe on sub-lane inputs and lowers to asubs/b.nemacro-op-fused back-edge on AArch64. The scalar remainder is inline instead of a recursive call intogeneric::. The same rewrite is applied tocheck_between.check_betweendiagnostic. The SIMD body checked the>= maxmask first and only looked at< minif the first was empty, so a later too-large lane could hide an earlier too-small one. Both bounds are now inspected before picking the returned pointer.has_vectypetyping. Declaredsize_t; retyped toboolto match its predicate role.Profiling
Verified on Apple M3 Pro (clang 17,
-O3 -DNDEBUG -mcpu=apple-m1):transform_binaryinlines fully — no functor calls, no extra moves, no spills. Hot loop is 5 instr/iter with one fused control-chain macro-op (subs/b.ne), matching the hand-written master baseline byte-for-byte in shape.add/sub/mul/div×float/int32/double, well inside run-to-run noise.mul<int64>(scalar fallback) is byte-identical.check_betweenmeasures ~1.8× faster than master in an inlined hot scan, because the new loop shape avoids a register-shuffle the post-incrementing form was triggering under inlining pressure.Tests
tests/test_buffer_simd.pypins_simd_feature() == "NEON"on aarch64 so a silent fallback to the scalar path cannot pass unnoticed. It then covers the int32 shape matrix (n = 1, 3, 4, 5, 8, 17) fortransform_binary, the int64-mul SFINAE fallback, and float sub/mul/div with one block + tail. A new privatemodmesh._modmesh._simd_feature()binding exposes the runtime-detected backend.Follow-up
simd::check_betweenhas inconsistent bound semantics across paths: the NEON SIMD body treatsvalue == max_valas out-of-range, while the scalar fallback accepts it. Out of scope here; left for a separate change.Test plan
make gtesttests/test_buffer_simd.pyon aarch64🤖 Generated with Claude Code